Multitenant migrations execution #431

fenos · 2024-01-29T22:38:28Z

What kind of change does this PR introduce?

Feature

What is the current behavior?

Currently, migrations on tenants are run on the first request from when the server starts.
With this approach, there are a few different drawbacks:

Expensive migrations (ex adding an index) cannot be run as this impacts the response time
Build up connections to the database when concurrent requests come in

What is the new behavior?

This PR introduces 3 migrations strategy modes:

on_request
full_fleet
progressive

Migrations are now tracked in the multi-tenant db with migrations_version and migrations_status.
In all modes, migrations will be run only if a tenant is not on the latest version.

On Request

on_request mode will run the migrations to a tenant database during a request. Migrations will block the request lifecycle until completed.

In case the migration fails, the request will also fail with a 500 error.

Full Fleet

This mode offers a background scheduler to run migrations across all tenants asynchronously.

On Server Start, the instances race to acquire an advisory lock.
The instance that acquires the lock is responsible for selecting all tenants that need a migration to run.

We emit RunMigrationsOnTenant job for each tenant in a batch, we use a singleton key to make sure only 1 job for this specific tenant can exist at any one time.

Each Storage instance can consume the events and run the migrations on each single tenant database.
Migrations are retried automatically 3 times.

Failures don't impact the request lifecycle and the migrations will run on all tenants that are not on the latest version.

Progressive

This mode is a hybrid version of the 2 above. It works similarly to the full_fleet mode since migrations are run with an async job, but the tenants are selected during a request.

During a tenant request, we determine if the tenant needs a migration to run.
If it does we register the tenantId in a list in memory.

A background worker will pop a batch of 200 tenants from this list and schedule RunMigrationsOnTenant jobs every 1m.

Deduping between instances is done via the singletonKey and early checks.

In case of failure, migrations are retried 3 times, after which the status becomes FAILED_STALE.
When using progressive migrations and the staus is FAILED_STALE we will continue running the migrations during the request lifecycle without blocking the request for this tenant, with the hope that it recovers from unexpected prolonged issues to the underline database.

Additional context

With this change, we will see the following benefits:

Increased performance on req/res time on server restart
Fewer database connections to tenant databases
Possibility to run long-running async migrations

coveralls · 2024-01-30T11:11:05Z

Pull Request Test Coverage Report for Build 7842314553

-676 of 1052 (35.74%) changed or added relevant lines in 33 files are covered.
17 unchanged lines in 5 files lost coverage.
Overall coverage decreased (-3.7%) to 78.602%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
src/http/routes/health/healthcheck.ts	0	1	0.0%
src/queue/workers.ts	1	2	50.0%
src/storage/tus/postgres-locker.ts	0	1	0.0%
src/pubsub/postgres.ts	13	15	86.67%
src/monitoring/logger.ts	21	24	87.5%
src/database/pubsub.ts	3	8	37.5%
src/concurrency/mutex.ts	7	28	25.0%
src/http/plugins/log-request.ts	5	26	19.23%
src/http/routes/admin/migrations.ts	17	39	43.59%
src/http/routes/admin/tenants.ts	40	86	46.51%

Files with Coverage Reduction	New Missed Lines	%
src/monitoring/logger.ts	1	90.59%
src/queue/workers.ts	1	37.5%
src/queue/queue.ts	2	15.05%
src/database/tenant.ts	3	49.66%
src/storage/errors.ts	10	67.97%

Totals
Change from base Build 7842206814:	-3.7%
Covered Lines:	8197
Relevant Lines:	10369

💛 - Coveralls

fenos · 2024-02-05T13:51:52Z

Dockerfile

@@ -1,24 +1,36 @@
-FROM node:18-alpine
+# Base stage for shared environment setup


Improved Dockerfile to use node_modules cache on each step

fenos · 2024-02-05T13:52:54Z

.env.sample

 #######################################
 # MULTI_TENANT=true
 DATABASE_MULTITENANT_URL=postgresql://postgres:[email protected]:5433/postgres
-REQUEST_X_FORWARDED_HOST_REGEXP=
+REQUEST_X_FORWARDED_HOST_REGEXP=^([a-z]{20}).local.(?:com|dev)$


default value for multitenant x-forwarded-host: {tenant-id}.local.dev to make it easy to run Storage out of the box as multitenant

fenos · 2024-02-05T13:54:12Z

.env.sample

@@ -85,7 +78,7 @@ STORAGE_BACKEND=s3
 #######################################
 # S3 Backend
 #######################################
-STORAGE_S3_BUCKET=name-of-your-s3-bucket
+STORAGE_S3_BUCKET=supa-storage-bucket


default value for bucket name which is the same as the bucket we create on the docker-compose.
This makes it easy to simply rename the .env.sample to .env without any additional changes

fenos · 2024-02-05T13:54:46Z

README.md

@@ -23,9 +23,10 @@ A scalable, light-weight object storage service.

 - Copy `.env.sample` to `.env` file.
 - Copy `.env.test.sample` to `.env.test`.
- Change `GLOBAL_S3_BUCKET` and `REGION` to the name and region of a S3 bucket.


Now simply rename .env.sample to .env and that's it, all default values will work out of the box

fenos · 2024-02-05T13:59:46Z

src/config.ts

@@ -258,8 +261,9 @@ export function getConfig(options?: { reload?: boolean }): StorageConfigType {
    logflareEnabled: getOptionalConfigFromEnv('LOGFLARE_ENABLED') === 'true',
    logflareApiKey: getOptionalConfigFromEnv('LOGFLARE_API_KEY'),
    logflareSourceToken: getOptionalConfigFromEnv('LOGFLARE_SOURCE_TOKEN'),
-    defaultMetricsEnabled:
-      getOptionalConfigFromEnv('DEFAULT_METRICS_ENABLED', 'ENABLE_DEFAULT_METRICS') === 'true',
+    defaultMetricsEnabled: !(


default metrics are now on by default

fenos · 2024-02-05T14:03:56Z

src/database/migrate.ts

@@ -346,7 +396,7 @@ async function refreshMigrationPosition(
  if (shouldUpdateMigrations) {
    await client.query(`BEGIN`)
    try {
-      await client.query(`DELETE FROM ${migrationTableName}`)
+      await client.query(`DELETE FROM ${migrationTableName} WHERE id is not NULL`)


running deletes without where statements seems to be blocked by default on staging / prod with a pg extension

fenos · 2024-02-05T14:05:52Z

src/http/routes/tenant/migrations.ts

+    if (!pgQueueEnable) {
+      return reply.code(400).send({ message: 'Queue is not enabled' })
+    }
+    const queueSize = await Queue.getInstance().getQueueSize(RunMigrationsOnTenants.getQueueName())


admin route to see how many remaining migrations are pending at: GET /admin/migrations

payload:

{ "remaining": 10 }

fenos · 2024-02-05T14:06:58Z

src/monitoring/logger.ts

+    const is4xxResponse = statusOfType(log.res.statusCode, 400)
+    const is5xxResponse = statusOfType(log.res.statusCode, 500)
+
+    const logLevel = is4xxResponse ? 'warn' : is5xxResponse ? 'error' : 'info'


Thanks to @egor-romanov to have pointed out that request logs were always info.
This changes the log level:

anything 4xx is a warning

anything 5xx is an error

fenos · 2024-02-05T14:07:42Z

src/queue/events/base-event.ts

    return {}
  }

+  static batchSend<T extends BaseEvent<any>[]>(messages: T) {


Implemented batch insert for inserting multiple jobs in one insert statement

fenos · 2024-02-05T14:07:59Z

src/queue/events/run-migrations.ts

+  tenantId: string
+}
+
+export class RunMigrationsOnTenants extends BaseEvent<RunMigrationsPayload> {


Job that handles tenant migrations

fenos · 2024-02-05T14:09:37Z

src/server.ts

@@ -28,6 +33,20 @@ const exposeDocs = true
  if (isMultitenant) {
    await runMultitenantMigrations()
    await listenForTenantUpdate(PubSub)
+
+    runMigrationsOnAllTenants()


on server start up, all storage api instances will race to acquire the lock.
The one who acquired the lock will create the jobs for running the migrations, the other instances who failed will simply return and do nothing.

inian · 2024-02-09T10:31:43Z

src/database/migrations/migrate.ts

+    return
+  }
+  const result = await multitenantKnex.raw(`SELECT pg_try_advisory_lock(?);`, [
+    '-8575985245963000605',


whats does this number represent?

the lock id

github-actions · 2024-02-09T13:22:55Z

🎉 This PR is included in version 0.48.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

feat: async tenant migrations

6ba2add

fenos force-pushed the multitenant/migrations-execution branch 2 times, most recently from 9075ee7 to 30370d1 Compare January 30, 2024 11:08

fenos force-pushed the multitenant/migrations-execution branch from 30370d1 to 5e47df6 Compare January 30, 2024 12:03

rebase

b8e74aa

fenos force-pushed the multitenant/migrations-execution branch 2 times, most recently from d350181 to 687e60e Compare February 5, 2024 13:50

fenos commented Feb 5, 2024

View reviewed changes

fenos force-pushed the multitenant/migrations-execution branch 11 times, most recently from 707b53f to 186f5e8 Compare February 8, 2024 16:54

feat: monitoring with grafana & prometheus

e291a85

fenos force-pushed the multitenant/migrations-execution branch from 186f5e8 to e291a85 Compare February 9, 2024 09:38

fix: rebase master

5ba7cc1

fenos force-pushed the multitenant/migrations-execution branch from ca4dfb8 to 5ba7cc1 Compare February 9, 2024 09:56

fenos mentioned this pull request Feb 9, 2024

pgboss on error event handler #430

Closed

inian approved these changes Feb 9, 2024

View reviewed changes

fenos merged commit 83ad222 into master Feb 9, 2024

fenos deleted the multitenant/migrations-execution branch February 9, 2024 10:41

github-actions bot added the released label Feb 9, 2024

		@@ -1,24 +1,36 @@
		FROM node:18-alpine
		# Base stage for shared environment setup

Uh oh!

Multitenant migrations execution #431

Multitenant migrations execution #431

Uh oh!

Conversation

fenos commented Jan 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What kind of change does this PR introduce?

What is the current behavior?

What is the new behavior?

On Request

Full Fleet

Progressive

Additional context

Uh oh!

coveralls commented Jan 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 7842314553

💛 - Coveralls

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fenos Feb 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 9, 2024

Uh oh!

Uh oh!

fenos commented Jan 29, 2024 •

edited

Loading

coveralls commented Jan 30, 2024 •

edited

Loading

fenos Feb 5, 2024 •

edited

Loading